{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ec4ea046-ac11-4cb8-9531-b6a90fea6ecb",
   "metadata": {},
   "source": [
    "# L5: Prompt Compression\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b74f85b7-00a0-4915-b9f5-ea916357a4ba",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f51fd497-83e5-49fa-abf9-4df7f70276f7",
   "metadata": {
    "height": 64
   },
   "outputs": [],
   "source": [
    "# 警告控制\n",
    "import warnings\n",
    "\n",
    "# 忽略所有警告信息，以免干扰程序输出\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f475605b-e4bc-4d6e-993a-b5bee2f60951",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "#pip install llmlingua"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1269d100-d775-44d0-9198-25a7194234f2",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": "import custom_utils  # 导入自定义工具库 custom_utils"
  },
  {
   "cell_type": "markdown",
   "id": "b136ae50-743e-4b4f-bede-201cdf68c13a",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px\"> 💻 &nbsp; <b>Access <code>requirements.txt</code> and <code>utils</code> files:</b> To access <code>requirements.txt</code> for this notebook, 1) click on the <em>\"File\"</em> option on the top menu of the notebook and then 2) click on <em>\"Open\"</em>. For more help, please see the <em>\"Appendix - Tips and Help\"</em> Lesson.</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad100d3a-2878-49d8-9149-7cf17b762ae3",
   "metadata": {},
   "source": [
    "## Data Loading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ba54c1f-108c-49f2-bcdb-234c06df3eaa",
   "metadata": {
    "height": 166
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "import pandas as pd\n",
    "\n",
    "# 加载数据集\n",
    "dataset = load_dataset(\"MongoDB/airbnb_embeddings\", streaming=True, split=\"train\")\n",
    "dataset = dataset.take(100)  # 取前100条数据\n",
    "\n",
    "# 将数据集转换为 pandas 数据框\n",
    "dataset_df = pd.DataFrame(dataset)\n",
    "\n",
    "# 显示前5条数据\n",
    "dataset_df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cef8a3ce-e87d-49f3-a935-530cb54d6f2e",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "# 打印数据框的列名\n",
    "print(\"Columns:\", dataset_df.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75cabb79-54bd-443f-8edc-ba80ec529be5",
   "metadata": {},
   "source": [
    "## Document Modelling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03e640da-4906-4ade-8f0e-9559f7239575",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "# 使用自定义工具库处理记录，并将结果存储在 listings 变量中\n",
    "listings = custom_utils.process_records(dataset_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d8dc7e1-d108-4a40-8294-4325b69fd5a4",
   "metadata": {},
   "source": [
    "## Database Creation and Connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d124791c-f015-440b-9d64-6cb7d6c7a6c9",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "# 使用自定义工具库连接到数据库，并获取数据库和集合对象\n",
    "db, collection = custom_utils.connect_to_database()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5865a1f-9db7-4eb4-b768-e3353d637121",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "# 删除集合中所有现有的记录\n",
    "collection.delete_many({})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d616ad62-e4b2-4086-b596-14320baa5db6",
   "metadata": {},
   "source": [
    "## Data Ingestion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b602c4d7-fd03-4fa4-be63-0b22b29724b7",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "# 插入处理后的记录到集合中\n",
    "collection.insert_many(listings)\n",
    "print(\"Data ingestion into MongoDB completed\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9567637d-9c54-4594-81b7-592f634eb7fa",
   "metadata": {},
   "source": [
    "## Vector Search Index defintion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "061b04e8-b279-41dc-80fc-d94fcd544af8",
   "metadata": {
    "height": 64
   },
   "outputs": [],
   "source": [
    "# 创建带有过滤器的向量搜索索引\n",
    "custom_utils.setup_vector_search_index_with_filter(collection=collection)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36804e3d-7cff-4b7d-9a6e-018806f1fdbf",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> ⏳ <b>Note:</b> If the output of the previous cell is <code>Error creating vector search index: Duplicate Index</code> you may proceed to the next cell if you intend to still use a previously created index.</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "678beaa8-f9bb-4f60-bf21-1e56916c6da9",
   "metadata": {},
   "source": [
    "## Handling User Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea395473-e021-487e-a794-280655380fb2",
   "metadata": {
    "height": 217
   },
   "outputs": [],
   "source": [
    "from pydantic import BaseModel\n",
    "from typing import Optional\n",
    "import custom_utils\n",
    "\n",
    "class SearchResultItem(BaseModel):\n",
    "    name: str  # 房源名称\n",
    "    accommodates: Optional[int] = None  # 可容纳人数，可选\n",
    "    address: custom_utils.Address  # 地址信息\n",
    "    neighborhood_overview: Optional[str] = None  # 邻里概况，可选\n",
    "    notes: Optional[str] = None  # 备注，可选\n",
    "    averageReviewScore: Optional[float] = None  # 平均评论得分，可选\n",
    "    number_of_reviews: Optional[float] = None  # 评论数量，可选\n",
    "    combinedScore: Optional[float] = None  # 综合评分，可选"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d50e3e1-92ae-4957-86e8-77f6936b7f73",
   "metadata": {},
   "source": [
    "## Boosting Search Results After Vector Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d86e9f0-5a7e-43fb-b855-ecf3ee2a46a5",
   "metadata": {
    "height": 387
   },
   "outputs": [],
   "source": [
    "# 定义计算平均评论得分和评论数量加权得分的聚合阶段\n",
    "review_average_stage = {\n",
    "    \"$addFields\": {\n",
    "        \"averageReviewScore\": {\n",
    "            \"$divide\": [\n",
    "                {\n",
    "                    \"$add\": [\n",
    "                        \"$review_scores.review_scores_accuracy\",\n",
    "                        \"$review_scores.review_scores_cleanliness\",\n",
    "                        \"$review_scores.review_scores_checkin\",\n",
    "                        \"$review_scores.review_scores_communication\",\n",
    "                        \"$review_scores.review_scores_location\",\n",
    "                        \"$review_scores.review_scores_value\",\n",
    "                    ]\n",
    "                },\n",
    "                6  # 除以评论评分类型的数量以获得平均值\n",
    "            ]\n",
    "        },\n",
    "        # 根据评论数量计算评分提升因子\n",
    "        \"reviewCountBoost\": \"$number_of_reviews\"\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ecb2cff-50de-4656-bc62-1586e9a06b9a",
   "metadata": {
    "height": 217
   },
   "outputs": [],
   "source": [
    "# 定义加权评分的聚合阶段\n",
    "weighting_stage = {\n",
    "    \"$addFields\": {\n",
    "        \"combinedScore\": {\n",
    "            # 结合平均评论得分和评论数量提升的示例公式\n",
    "            \"$add\": [\n",
    "                {\"$multiply\": [\"$averageReviewScore\", 0.3]},  # 加权平均评论得分\n",
    "                {\"$multiply\": [\"$reviewCountBoost\", 0.7]}   # 加权评论数量提升\n",
    "            ]\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6d230f8-324c-4466-8efc-e02e2ea3ff30",
   "metadata": {
    "height": 98
   },
   "outputs": [],
   "source": [
    "# 应用 combinedScore 进行排序的聚合阶段\n",
    "sorting_stage_sort = {\n",
    "    \"$sort\": {\"combinedScore\": -1}  # 按降序排列，以提升较高的综合评分\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0318142e-32a2-4bb8-90ec-c1d059d93b95",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "# 定义额外的聚合阶段，包括计算平均评论得分、加权评分和排序\n",
    "additional_stages = [review_average_stage, weighting_stage, sorting_stage_sort]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1c1d370-e016-4986-9668-f9da24ca62a4",
   "metadata": {},
   "source": [
    "## Modified Handling User Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "503662bb-7685-4e64-8523-ceed3330b04d",
   "metadata": {
    "height": 642
   },
   "outputs": [],
   "source": [
    "from IPython.display import display, HTML\n",
    "import pprint\n",
    "\n",
    "def handle_user_query(query, db, collection, stages=[], vector_index=\"vector_index_text\"):\n",
    "    \"\"\"\n",
    "    处理用户查询并返回系统响应和源信息。\n",
    "\n",
    "    Args:\n",
    "    query (str): 用户的查询字符串。\n",
    "    db (MongoClient.database): 数据库对象。\n",
    "    collection (MongoCollection): 要搜索的 MongoDB 集合。\n",
    "    stages (list): 额外的聚合阶段要包括在管道中。\n",
    "    vector_index (str): 向量索引名称，默认为 \"vector_index_text\"。\n",
    "\n",
    "    Returns:\n",
    "    str: 系统响应。\n",
    "    \"\"\"\n",
    "    # 执行向量搜索\n",
    "    get_knowledge = custom_utils.vector_search_with_filter(query, db, collection, stages, vector_index)\n",
    "\n",
    "    # 检查是否有结果\n",
    "    if not get_knowledge:\n",
    "        return \"No results found.\", \"No source information available.\"\n",
    "    \n",
    "    # 将搜索结果转换为 SearchResultItem 模型列表\n",
    "    search_results_models = [\n",
    "        SearchResultItem(**result)\n",
    "        for result in get_knowledge\n",
    "    ]\n",
    "\n",
    "    # 将搜索结果转换为 DataFrame 以便在 Jupyter 中更好地呈现\n",
    "    search_results_df = pd.DataFrame([item.dict() for item in search_results_models])\n",
    "\n",
    "    # 打印未压缩的提示（查询信息）\n",
    "    print(\"Uncompressed Prompt (Query Info):\\n\")\n",
    "    print(search_results_df)\n",
    "\n",
    "    # 使用 OpenAI 的 completion 生成系统响应\n",
    "    completion = custom_utils.openai.ChatCompletion.create(\n",
    "        model=\"gpt-3.5-turbo\",\n",
    "        messages=[\n",
    "            {\n",
    "                \"role\": \"system\", \n",
    "                \"content\": \"You are an Airbnb listing recommendation system.\"\n",
    "            },\n",
    "            {\n",
    "                \"role\": \"user\", \n",
    "                \"content\": f\"Answer this user query: {query} with the following context:\\n{search_results_df}\"\n",
    "            }\n",
    "        ]\n",
    "    )\n",
    "    system_response = completion.choices[0].message['content']\n",
    "\n",
    "    # 打印用户问题、系统响应和源信息\n",
    "    print(f\"- User Question:\\n{query}\\n\")\n",
    "    print(f\"- System Response:\\n{system_response}\\n\")\n",
    "\n",
    "    # 以 HTML 表格形式显示 DataFrame\n",
    "    display(HTML(search_results_df.to_html()))\n",
    "\n",
    "    # 返回结构化响应和源信息作为字符串\n",
    "    return system_response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43e186ba-c53e-484d-b311-31e68b8128f8",
   "metadata": {
    "height": 234
   },
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "I want to stay in a place that's warm and friendly, \n",
    "and not too far from restaurants, can you recommend a place? \n",
    "Include a reason as to why you've chosen your selection.\n",
    "\"\"\"\n",
    "\n",
    "# 处理用户查询并获取响应\n",
    "response = handle_user_query(\n",
    "    query, \n",
    "    db, \n",
    "    collection, \n",
    "    additional_stages, \n",
    "    vector_index=\"vector_index_with_filter\"\n",
    ")\n",
    "response"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b08a2cfe-22f2-4421-a1ec-d2f53bee2fcb",
   "metadata": {},
   "source": [
    "## Prompt Compression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70aa1f19-3a43-4800-8984-ea343dc5dfda",
   "metadata": {
    "height": 557
   },
   "outputs": [],
   "source": [
    "import json\n",
    "from llmlingua import PromptCompressor\n",
    "\n",
    "# 初始化 PromptCompressor 实例\n",
    "llm_lingua = PromptCompressor(\n",
    "    model_name=\"microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank\",\n",
    "    model_config={\"revision\": \"main\"},\n",
    "    use_llmlingua2=True,\n",
    "    device_map=\"cpu\",\n",
    ")\n",
    "\n",
    "# 定义压缩查询提示的函数\n",
    "def compress_query_prompt(query):\n",
    "    \"\"\"\n",
    "    压缩用户查询提示。\n",
    "\n",
    "    Args:\n",
    "    query (dict): 包含 demonstration_str, instruction 和 question 的字典。\n",
    "\n",
    "    Returns:\n",
    "    str: 压缩后的查询提示，以 JSON 字符串形式返回。\n",
    "    \"\"\"\n",
    "    demonstration_str = query['demonstration_str']\n",
    "    instruction = query['instruction']\n",
    "    question = query['question']\n",
    "\n",
    "    # 执行 6 倍压缩\n",
    "    compressed_prompt = llm_lingua.compress_prompt(\n",
    "        demonstration_str.split(\"\\n\"), \n",
    "        instruction=instruction,\n",
    "        question=question,\n",
    "        target_token=500,\n",
    "        rank_method=\"longllmlingua\", \n",
    "        context_budget=\"+100\",\n",
    "        dynamic_context_compression_ratio=0.4,\n",
    "        reorder_context=\"sort\",\n",
    "    )\n",
    "\n",
    "    return json.dumps(compressed_prompt, indent=4)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1aadfd10-c9ba-4f93-9f0d-b99771219e28",
   "metadata": {
    "height": 574
   },
   "outputs": [],
   "source": [
    "def handle_user_query_with_compression(query, db, collection, stages=[], vector_index=\"vector_index_text\"):\n",
    "    \"\"\"\n",
    "    处理用户查询并压缩提示。\n",
    "\n",
    "    Args:\n",
    "    query (str): 用户的查询字符串。\n",
    "    db (MongoClient.database): 数据库对象。\n",
    "    collection (MongoCollection): 要搜索的 MongoDB 集合。\n",
    "    stages (list): 额外的聚合阶段要包括在管道中。\n",
    "    vector_index (str): 向量索引名称，默认为 \"vector_index_text\"。\n",
    "\n",
    "    Returns:\n",
    "    tuple: 搜索结果的 DataFrame 和压缩后的提示。\n",
    "    \"\"\"\n",
    "    # 执行向量搜索以从数据库获取知识\n",
    "    get_knowledge = custom_utils.vector_search_with_filter(query, db, collection, stages, vector_index)\n",
    "\n",
    "    # 检查是否有结果\n",
    "    if not get_knowledge:\n",
    "        return None, \"No results found.\"\n",
    "\n",
    "    # 将搜索结果转换为 SearchResultItem 模型列表\n",
    "    search_results_models = [SearchResultItem(**result) for result in get_knowledge]\n",
    "\n",
    "    # 将搜索结果转换为 DataFrame 以便更好地呈现\n",
    "    search_results_df = pd.DataFrame([item.dict() for item in search_results_models])\n",
    "\n",
    "    # 准备压缩信息\n",
    "    query_info = {\n",
    "        'demonstration_str': search_results_df.to_string(),  # 信息检索过程中的结果\n",
    "        'instruction': \"Write a high-quality answer for the given question using only the provided search results.\",\n",
    "        'question': query  # 用户查询\n",
    "    }\n",
    "\n",
    "    # 使用预定义函数压缩查询提示\n",
    "    compressed_prompt = compress_query_prompt(query_info)\n",
    "\n",
    "    # 可选：打印压缩后的提示以进行调试\n",
    "    print(\"Compressed Prompt:\\n\")\n",
    "    pprint.pprint(compressed_prompt)\n",
    "    print(\"\\n\" + \"=\" * 80 + \"\\n\")\n",
    "\n",
    "    return search_results_df, compressed_prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "599d2704-4451-4db3-8e56-8eedb0de82f6",
   "metadata": {
    "height": 455
   },
   "outputs": [],
   "source": [
    "def handle_system_response(query, compressed_prompt):\n",
    "    \"\"\"\n",
    "    处理系统响应。\n",
    "\n",
    "    Args:\n",
    "    query (str): 用户的查询字符串。\n",
    "    compressed_prompt (str): 压缩后的提示信息。\n",
    "\n",
    "    Returns:\n",
    "    str: 系统响应。\n",
    "    \"\"\"\n",
    "    # 使用 OpenAI 的 completion 生成系统响应\n",
    "    completion = custom_utils.openai.ChatCompletion.create(\n",
    "        model=\"gpt-3.5-turbo\",\n",
    "        messages=[\n",
    "            {\n",
    "                \"role\": \"system\",\n",
    "                \"content\": \"You are an Airbnb listing recommendation system.\"\n",
    "            },\n",
    "            {\n",
    "                \"role\": \"user\",\n",
    "                \"content\": f\"Answer this user query: {query} with the following context:\\n{compressed_prompt}\"\n",
    "            }\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    system_response = completion.choices[0].message['content']\n",
    "\n",
    "    # 打印用户问题和系统响应\n",
    "    print(f\"- User Question:\\n{query}\\n\")\n",
    "    print(f\"- System Response:\\n{system_response}\\n\")\n",
    "\n",
    "    # 返回系统响应\n",
    "    return system_response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0aca573-77f8-49dc-b232-1e991dbffd29",
   "metadata": {
    "height": 149
   },
   "outputs": [],
   "source": [
    "# 压缩查询并获取搜索结果\n",
    "results, compressed_prompt = handle_user_query_with_compression(\n",
    "    query, \n",
    "    db, \n",
    "    collection, \n",
    "    additional_stages, \n",
    "    vector_index=\"vector_index_with_filter\"\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebdb52e2-0bd2-41be-a523-5ad73cdc38fa",
   "metadata": {
    "height": 115
   },
   "outputs": [],
   "source": [
    "if compressed_prompt:\n",
    "    # 使用压缩后的提示处理系统响应\n",
    "    system_response = handle_system_response(query, compressed_prompt)\n",
    "else:\n",
    "    print(\"No valid results to display.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef642368-5cee-4b6d-ac11-252cfea3c757",
   "metadata": {},
   "source": [
    "## Additional Resouces\n",
    "\n",
    "- [The MongoDB Developer Center for tutorials, articles and videos](https://mdb.link/developer_center)\n",
    "\n",
    "- [The GenAI Showcase Repo for code showcasing building AI applications and demo](https://github.com/mongodb-developer/GenAI-Showcase)\n",
    "\n",
    "- [DeepLearning AI forum for question in regards to this course](https://community.deeplearning.ai/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "156ab944-4368-42fd-bdc8-fdc27eab591c",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6840a77e-6eb9-4bcf-a45d-708c8a3de213",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9f53e84-82c3-454f-96b8-35e74c683a02",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac627208-461a-4998-9fc2-4460d6020fe0",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
